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The Pen Technologies group at IBM Research 
has recently been investigating methods for 
retrieving handwritten documents based on 
user queries. This paper investigates the use 
of typed and handwritten queries to retrieve 
relevant handwritten documents. The IBM 
handwriting recognition engine was used to 
generate A/-best lists for the words in each of 
108 short documents. These A/-best lists are 
concise statistical representations of the 
handwritten words. These statistical 
representations enable the retrieval methods 
to be robust when there are machine 
transcription errors, allowing retrieval of 
documents that would be missed by a 
traditional transcription-based retrieval 
system. Our experimental results demonstrate 
that significant improvements in retrieval 
performance can be achieved compared to 
standard keyword text searching of machine- 
transcribed documents. We have developed a 
software architecture for a multimedia 
document retrieval framework into which 
machine learning algorithms for feature 
extraction and matching may be easily 
integrated. The framework provides a “plug- 
and-play” mechanism for the integration of 
new media types, new feature extraction 
methods, and new document types. 


One of the most powerful benefits of electronic doc¬ 
uments is the ability to retrieve information auto¬ 
matically from a database on the basis of some search 
criteria. The last few years have seen accelerating 


progress in methods for multimedia retrieval, includ¬ 
ing methods based on text meta-data attached to 
nontext media (e.g., text annotations of speech and 
video) and methods based on automatic extraction 
of nontext characteristics of multimedia that can be 
used for nontext queries such as images. Some of 
these are based on human-generated text descrip¬ 
tions and indexing of these documents, some on au¬ 
tomatic generation of text descriptions (e.g., face rec¬ 
ognition), and some on abstract query-by-example 
methods (e.g., locating images with color histograms 
similar to a sample image ). 1,2 Progress in the last is 
clearly illustrated in the recent Multimedia Content 
Description Interface, MPEG-7, work of the Moving 
Picture Experts Group to standardize the descrip¬ 
tion and representation of multimedia features for 
retrieval purposes . 3 

Speech, scanned text, and handwritten documents 
have been made more accessible for retrieval by us¬ 
ing machine learning algorithms. These algorithms 
generate text transcriptions and then use conven¬ 
tional text search technology to retrieve the corre¬ 
sponding nontext document. Matches from the text 
search are then used to retrieve the original scanned 
or speech documents. If precise transcripts of these 
documents exist, information retrieval (ir) tech¬ 
niques can be applied; however, such transcripts are 
typically too costly to generate by hand, and machine 
learning methods for automating the process of tran- 

©Copyright 2002 by International Business Machines Corpora¬ 
tion. Copying in printed form for private use is permitted with¬ 
out payment of royalty provided that (f) each reproduction is done 
without alteration and (2) the Journal reference and IBM copy¬ 
right notice are included on the first page. The title and abstract, 
but no other portions, of this paper may be copied or distributed 
royalty free without further permission by computer-based and 
other information-service systems. Permission to republish any 
other portion of this paper must be obtained from the Editor. 


494 PERRONE, RUSSELL, AND ZIQ 


0018-8670/02/$5.00 © 2002 IBM 


IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 



script generation are far from perfect. 4 Thus, such 
transcripts are usually incomplete or corrupted by 
incorrect transcriptions. 

It has been observed 5 that IR is not significantly de¬ 
graded when the documents to be retrieved are 
machine-printed documents that have been tran¬ 
scribed using machine optical character recognition 
(OCR) methods. Apparently, OCR of machine-printed 
documents is sufficiently accurate. When transcrip¬ 
tion is inaccurate, word redundancy in the target doc¬ 
uments may compensate; 6 however, in general, suf¬ 
ficient word redundancy cannot be assumed, 
especially for short documents. 

The problem of transcription errors on retrieval in 
the context of speech has been addressed. One ap¬ 
proach 7 relies on query expansion, a second ap¬ 
proach 8 employs a variety of string distance meth¬ 
ods, and a third approach 9 uses global information 
about probable phoneme confusions in the form of 
an average confusion matrix for all data observed 
but does not handle confusions at the individual word 
instance level. 

A class of successful approaches uses template 
matching between handwritten queries and hand¬ 
written documents; 10-13 however, this method can be 
very slow if the number of documents to be searched 
is large and the match method is very complex; also, 
this method does not allow for text queries. Another 
approach 14 successfully used pieces of handwritten 
words to handle inaccuracies in machine transcrip¬ 
tion. This approach attempts to reduce the complex¬ 
ity of the transcription process at the expense of al¬ 
lowing certain words to become ambiguous. As one 
might expect, this approach was found to work well 
in domains in which words were long and easily dis¬ 
tinguishable but less well in domains with many sim¬ 
ilar words. 

Current search engines are fragmented in that each 
engine handles a single media type, or a limited set 
of media types, and these engines are not easily in¬ 
teroperable. Each time such a system is constructed, 
similar design issues are revisited again and again, 
repeating existing work and leading to stand-alone 
systems that can utilize one another’s search capa¬ 
bilities only after a considerable effort at integration. 
As a way to avoid these problems, we are prototyp¬ 
ing a flexible and extensible Multimedia Document 
Retrieval (mdr) System. This mdr System will help 
achieve four goals: Provide a uniform search facility 
upon which client applications can be developed 


without regard to media-specific issues; provide a 
mechanism to streamline the implementation, test¬ 
ing, and distribution of new multimedia search al¬ 
gorithms; provide a framework for leveraging exist¬ 
ing search algorithms; and provide a means for 
naturally extending to “cross-media” searches, i.e., 
searches between two or more different media types. 

In the next section, we address the issues facing mul¬ 
timedia document retrieval by describing a flexible, 
extensible, “plug-and-play” multimedia document 
retrieval framework. In the third section, we describe 
specific details of the mdr approach when applied 
to the task of handwritten document retrieval. In the 
fourth section, we describe handwritten document 
retrieval experiments based on using the IBM hand¬ 
writing recognition engine to construct pattern ob¬ 
jects. In the fifth section, we present the results of 
these handwritten document retrieval experiments. 
The last section summarizes our findings. 

The MDR System 

Our prototype, the Multimedia Document Retrieval 
(mdr) System, shown in Figure 1, performs two ba¬ 
sic functions: the indexing of multimedia documents 
and the retrieval-by-query of indexed multimedia 
documents. We begin by giving a high-level overview 
of indexing and retrieval and then present compo¬ 
nent details. 

Indexing. Before retrieval is possible, an index must 
be built. The index is built in the following manner. 
A user interacts with an indexing client and requests 
that multimedia documents be indexed. The index¬ 
ing client is responsible for verifying that the cor¬ 
responding media types are supported, locating and 
retrieving the documents from the media database, 
and passing them to the index builder along with 
any user-specified indexing preferences. The index 
builder is responsible for converting the documents 
it receives into statistical representations that it then 
stores in the pattern database. Documents are pro¬ 
cessed by passing them to a media decomposer that 
is responsible for decomposing a multimedia doc¬ 
ument into its constituent “primitive” media ele¬ 
ments (e.g., an mpeg file might be decomposed into 
audio and video). Once a document is decomposed 
into primitive media elements, the media elements 
are then passed to their corresponding pattern build¬ 
ers that are responsible for generating statistical rep¬ 
resentations of the media. These representations are 
called pattern objects. The pattern objects are passed 
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Figure 1 The MDR System 



back to the index builder, which then adds them to 
the pattern database. 

Retrieval-by-query. Once an index has been built, 
a user may search the corresponding documents by 
submitting a multimedia query to the query client. 
The query client is responsible for constructing que¬ 
ries, validating that the media types in the query are 
supported, and submitting the query to the query en¬ 
gine. The query engine uses the media decomposer 
and pattern builders to convert the query into pat¬ 
tern objects, which are then compared to the pat¬ 
tern database entries using a pattern similarity met¬ 
ric. The comparison process results in a relevance 
score for each indexed document. These scores are 
then passed to the query client, which retrieves doc¬ 
uments from the media database ranked by their rel¬ 
evance to the query. 

Component details. The media database maybe any 
repository or set of repositories where media doc¬ 
uments are stored. The documents themselves are 
identified using URLs (uniform resource locators) and 
may consequently be stored in a local or remote file 
system, database, ftp (File Transfer Protocol) site, 
or Web server. 

The most general goal of the MDR System is to en¬ 
able retrieval of documents of any media type using 
queries of any media type. This goal requires the abil¬ 
ity to compare content in various media types. What 
is needed are compact media representations for 
which a measure of approximate match is easy to cal¬ 
culate. In the MDR System these representations are 
the pattern objects. A pattern object has four at¬ 
tributes: a URL, an extent, pattern data, and a pat¬ 
tern similarity metric. The URL points to the doc¬ 
ument from which the pattern object is derived. The 


extent describes which subset of the document cor¬ 
responds to the pattern object. The pattern data is 
some representation of the media within the extent. 
The pattern similarity metric measures the similar¬ 
ity of two pattern objects. The pattern objects are 
the core of the MDR System and therefore must be 
chosen with care. For nontext media such as hand¬ 
writing, speech, and video, machine learning algo¬ 
rithms are needed to assist in the construction of pat¬ 
tern objects. 

A pattern builder is the component of the MDR Sys¬ 
tem that converts specific, primitive, media elements 
into sets of pattern objects. Machine learning algo¬ 
rithms can be used to convert the media into sta¬ 
tistical representations of the media. Different pat¬ 
tern builders may be designed to extract the same 
pattern object type from different media types, thus 
enabling cross-media search and retrieval. 

The pattern objects are stored in a pattern object 
database and are retrieved using database queries 
that are themselves converted into pattern objects, 
which search the database for pattern objects sim¬ 
ilar to themselves. 

The retrieval process determines the relevance value 
of each document to a user-specified query. The doc¬ 
uments indexed by the pattern object database can 
then be sorted and retrieved based on these values. 
In general, queries are in disjunctive normal form 
and may also include additional constraints on meta¬ 
data such as creation time or authorship. The MDR 
System represents these complex queries as trees in 
which the leaf nodes contain the media elements and 
the parent nodes contain the relation information. 
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The query engine contains the functionality for ob¬ 
taining query results, using the query tree and the 
pattern database. Complex queries are processed by 
passing relevance scores from leaf nodes up the tree 
structure to the root node, merging relevance esti¬ 
mates according to the relation information built into 
the tree. The pattern database indices are used to 
prune the number of pattern objects that must be 
examined with the similarity metric. In addition to 
this low-level search optimization, standard database 
query optimization techniques may be applied to the 
query tree. 

Plug-and-play. A major goal of the mdr System is 
to create a “media agnostic” framework into which 
media-specific components can be easily and seam¬ 
lessly incorporated. The MDR System allows re¬ 
searchers and developers to focus on the media-spe¬ 
cific aspects of their work, while taking advantage 
of the media-independent services, i.e., query trees 
and merge methods. Additionally, the MDR frame¬ 
work provides abstract classes that a developer ex¬ 
tends and implements as needed for specific media 
types. 

Example: Handwritten media 

In the context of the MDR System, we have inves¬ 
tigated the document retrieval performance of sev¬ 
eral pattern objects for handwriting. In this case, the 
media elements are handwritten documents. 

Pattern object: The A/-best list. This paper uses a 
statistical classifier to convert each word of a hand¬ 
written document into a set of word or score pairs, 
one for each of the most likely text translations of 
the handwritten word. 15 This approach is robust 
when there are transcription errors (which is of par¬ 
ticular importance for low-frequency words) and has 
the ability to retrieve words that are not in the lex¬ 
icon of the machine transcription system. 

This set of scores is termed an “A-best list.” In prac¬ 
tice, each handwritten word in each document is con¬ 
verted into an A-best list. This step need only be done 
once. Each word of a handwritten query is likewise 
converted into an A-best list. For a text query, each 
word is converted into a trivial A-best list by giving 
a maximum score to the query word and a minimum 
score to all other A-best list entries. (That is, we as¬ 
sume no noise in recording a text query, though this 
assumption could easily be relaxed.) 

Let W be the set of all possible words and let Xbe 
a given handwritten occurrence of w E W. We de¬ 


fine the A-best list associated with X as the vector 
S(T) = (Si(X), S 2 (X ),...), where S t (X) is the score 
of X given w ,•, the i th word of W, according to some 
machine transcription system. In this paper we used 
an HMM (Hidden Markov Model) 16 trained on an 
unconstrained, writer-independent data set to cal¬ 
culate Si(T) as a measure of the HMM’s probability 
of Xgiven w t . In practice, we set a threshold for S t (X) 
to disregard low scores, which results in A-best lists 
averaging approximately 16 nonzero entries. For the 
rest of the paper, we drop explicit reference to X. 

In standard handwriting recognition systems, the A- 
best list is the final result of the recognition process 
and is used to indicate the correct transcription of 
the handwritten word. In some systems, if the first 
word in the A-best list is incorrect, a user can op¬ 
tionally select another word from the A-best list. 

Motivation for A/-best-list patterns. A-best list re¬ 
trieval compensates for transcription noise by allow¬ 
ing the search for words to go beyond the best match 
from the transcription process, i.e., the top word in 
the A-best list. For example, suppose a typed query 
of “cat” is used and the corresponding retrieval hand¬ 
writing had the following A-best list (sorted by score): 


Word 

cut 

cot 

cat 

lot 

let 

• • • 

Score 

100 

95 

94 

10 

5 

... 


If only the top-scoring word (“cut”) is used, the cor¬ 
rect handwriting will not be retrieved; however, if 
the A-best list is used, a well-designed method could 
use the additional information to detect that the cor¬ 
responding handwriting is more closely related to the 
query than the simple transcription would suggest. 

A-best lists can also correct for transcription noise 
caused by words that are unknown to the transcrip¬ 
tion model (e.g., proper names, symbols from for¬ 
eign languages, or even nontext handwriting such as 
arrows, circles, doodles, etc.). In many transcription 
systems, such words cannot appear in an A-best list. 
However, if a writer writes consistently, the statis¬ 
tical structure of the A-best lists of various handwrit¬ 
ing instances of the same thing should be similar, 
and that similarity should yield a good match of A- 
best lists. 

One might think that the addition of so many words 
through the use of A-best lists would add significant 
amounts of noise to the retrieval process; however, 
the likelihood of retrieving the wrong document is 
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not significantly increased by the use of A-best lists. 
This can be understood by considering the follow¬ 
ing: Typically an individual writer’s set of query words 
is much smaller than the set of all possible words; 
therefore, the likelihood is low that a transcription 
error will lead to a high score for a query word in 
an A-best list that does not correspond to that query 
word. Thus, incorrect document retrieval, i.e., a high 
score for the right word in the wrong A-best list, is 
not significantly increased as a result of the use of 
an A-best list. Of course, if queries have very similar 
handwriting representations (e.g., “puppy” and “pup¬ 
pet”), cross-confusion may be a problem. 

False positives occur when the transcription engine 
gives a high word score to handwriting that is incor¬ 
rectly transcribed. Incorrectly transcribed words 
other than query words rarely generate false posi¬ 
tives, since with a large vocabulary of known words, 
the engine will more often select words other than 
the query word for the A-best list. 

In summary, the A/-best list, combined with knowl¬ 
edge of the behavior of the HMM, provides a more 
comprehensive description of the document, while 
still facilitating effective search techniques. 

N-best list similarity metrics. Our A-best list retrieval 
methods work by defining a metric between the A- 
best list from a query and the A-best lists from a da¬ 
tabase. The metric scores for each A-best list in a 
given document are combined to generate a rele¬ 
vance score for the whole document. Documents are 
then ranked by their relevance scores and retrieved in 
rank order. The relative retrieval performance using 
various metrics can then be examined. We now define 
some metrics for which we have done experiments. 

Text metric. Handwritten documents can be retrieved 
using conventional text searches on machine tran¬ 
scribed text, with links back to the original handwrit¬ 
ten documents. We used this method as the base¬ 
line for our experiments. The text transcription for 
each document is simply the text assembled by tak¬ 
ing the highest-scoring word from each A-best list. 
The search terms are ASCII strings, taken from the 
hand-generated ground truth. The metric score is 
one or zero, depending on whether the query word 
matches any document word. The document score 
is the sum of the metric scores. 

Ranked text metric. The traditional text search can 
be enhanced by including other words from the A- 
best list generated by the transcription model. As a 


simple trial, we took the top three words from the 
A-best list and weighted them solely by rank: 1.0 for 
the top word, a for the second word, and /3 for the 
third word, where a and /3 were greater than zero 
and optimized on an independent data set. We then 
searched through this expanded document for the 
single ASCII search term. The metric score is the rank 
score of matching words. The document score is the 
sum of the metric scores. This metric score will al¬ 
ways be equal to or greater than the text metric, since 
the top word still has a weight of 1.0. However, the 
contributions of the other words cause additional 
documents to have nonzero scores and can change 
the rank ordering of the documents. This metric is 
very convenient and powerful because it requires very 
little information from the transcription model. Only 
the first three candidates need be stored and indexed, 
and no score information is needed. 

Scored text metric. In this metric, the expanded doc¬ 
ument includes up to 20 words from the A-best list. 
Each word is weighted proportionally to the score 
assigned it by the transcription model, normalized 
so that the sum of the scores for all the alternate 
words for one piece of handwriting sums to 1.0. The 
metric score is the matching word score, and the doc¬ 
ument score is the sum of the metric scores. 

Dot-product metric. The dot-product metric between 
a query A-best list, q , and a document A-best list, 
d , is given by 

COS (q, d) = |p| (1) 

which is always between 0 and 1 since the A-best list 
scores are nonnegative. 

From the A-best list perspective, the dot product is 
the sum of the products of the normalized scores of 
words that appear in both the query A-best list and 
a document word A-best list. The score for a doc¬ 
ument is the sum of the dot products of all the A- 
best lists in the document with the A-best list for the 
query handwriting. 

Experiments 

The data used in these experiments were collected 
from 108 writers, each of whom was asked to select 
from one of three topics and hand-write a one-page 
document. The categories were “thank you note,” 
“room description,” and “defective product letter.” 
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Within a category, the writers were free to write 
whatever they wanted. The documents were then 
transcribed by hand. The result was a database of 
108 documents with a total of 10985 words. These 
documents were collected using the CrossPad** 
notebook files and the IBM InkManager* software. 

The data set size in this experiment is much smaller 
than typical databases used in text retrieval research, 
though it is typical of available on-line handwriting 
recognition databases. However, from a speed and 
resources point of view, we expect our system to scale 
almost identically to keyword retrieval of text doc¬ 
uments because standard indexing methods can be 
used equally well with the TV-best lists. From a re¬ 
trieval accuracy point of view, we do not know 
whether the performance gains will increase or de¬ 
crease as the data set size scales, but we believe it 
will continue to outperform keyword search because 
of the error-tolerant nature of the algorithms. 

Each document was processed by a heuristic clus¬ 
tered 7 to group handwriting data. Handwriting clus¬ 
ters were then normalized and had features extracted 
for processing by a multistate, Bakis-topology, 
lexeme-based hmm. 16 The hmm returned an Af-best 
list for each word in each cluster of the top scoring 
words from a 30000-word lexicon. The HMM was 
trained in a writer-independent fashion on data from 
over 200 writers. Thus, each document in our da¬ 
tabase was converted into a set of word AMrest lists 
that summarize the scores of the HMM for the most 
likely words for each cluster of handwriting. 

The total number of words in all A/-best lists after 
this process was approximately 176 000, of which ap¬ 
proximately 28000 were unique out of a total pos¬ 
sible 30000-word lexicon. 

Query generation procedure. The handwritten data 
used for this paper were not originally collected for 
the purpose of testing IR techniques, and no attempt 
was made to collect separate handwritten query 
handwriting or even to identify what queries would 
be appropriate. Thus we were faced with two tasks: 
identifying plausible query words and obtaining the 
corresponding handwriting. 

Identifying query words. Query words were selected 
based on word frequencies and availability of other 
renderings of the same word. These queries were 
taken from documents from other writers. Ideally, 
we would have used query words written by the writer 
of each letter; however, those data were not collected 


since this data set was originally collected for another 
purpose and, after the fact, the writers were not avail¬ 
able to write additional words as queries. Thus, the 
results presented here are lower than one would ex¬ 
pect if the same writer had written both the queries 
and the documents to be retrieved. 

We take a standard approach 1819 to query selection. 
We define tf (t, d ), the term frequency, as the num¬ 
ber of occurrences of a term t within a given doc¬ 
ument d. We define idf(7), the inverse document fre¬ 
quency, as the inverse of the number of documents 
in the database that contain a term t. We selected 
queries for each document by choosing the terms in 
the document that had the highest tf(t, <i)*idf(7) 
product, subject to the constraint that the query 
words exist in the ground truth for at least one other 
document, to ensure that handwriting is available 
outside the target document for use as a query. We 
chose the five words from each document having the 
highest tf*idf product subject to this constraint. 

The various document scores were: the word count 
for text metric, the sum of the rank weights for 
ranked text metric, the sum of the metrics for scored 
text metric, and the sum of the dot products for dot- 
product metric. We did not use any weighting of 
query words (e.g., Okapi 20,21 ) since the query words 
were chosen based on tf*idf, so the differences in 
tf*idf would be modest. Because of the small size of 
the documents, we do not expect high query word 
redundancy in any of the documents. 

Obtaining handwritten queries. Since our document 
database did not include word-level ground truth, 
we relied on the output of our handwriting recog¬ 
nizer to identify query handwriting from documents 
other than the one for which the query word was se¬ 
lected. From all of the documents that included a 
query word in their ground truth, we selected the 
N -best list with the highest score for the query word. 
Because of errors in machine transcription, it is pos¬ 
sible that some query handwriting corresponded to 
a word other than the desired query word. Thus, the 
transcription-based algorithms may retrieve docu¬ 
ments using the wrong query handwriting, making 
the results more pessimistic than they would be in 
practice. 

Since the query word is drawn from one of the doc¬ 
uments in the database, single-word queries include 
as part of their “truth set,” i.e., the set of documents 
that are correct to retrieve, the document from which 
the word was drawn. Consequently, the reported 
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Figure 2 Single word query performance over all 
540 one-word queries 
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handwritten queries of multiword handwritten doc¬ 
uments. For multiword queries, the score for a doc¬ 
ument was determined by multiplying the single term 
scores for the document. This process corresponds 
to an “and” operation over all query words. In or¬ 
der to prevent documents from dropping out if a sin¬ 
gle query word was missing, a small positive number 
was added to all word scores. This addition intro¬ 
duced a very large penalty for missing query words, 
but the document would still stay in the result list. 
Primarily affected is the high recall region of the 
precision-recall curves. 

Precision and recall are used to measure the retrieval 
performance of the system. Precision is the percent¬ 
age of the documents retrieved by a query that are 
correct, and recall is the percentage of all correct 
documents that are retrieved by a query. Below, we 
define these terms more precisely. 


Figure 3 Two-word query performance averaged over all 
1080 two-word queries 



For each document d , we calculate a relevance score 
to query q. Let the truth set of q be the set of doc¬ 
uments whose ground truth text contains the ground 
truth of q. Let n(q) be the number of documents 
that are in the truth set of query q. Let nr(q , 0) be 
the number of documents with a relevance score to 
q above a threshold, 0. Let nc(q , 0) be the number 
of documents with a relevance score to q above 0 
which are also in the truth set of q. 

Using these definitions, we define precision and re¬ 
call as follows: 


Recall(0) = 



nc(q , 0) 
n(q) 


Precision^) 



nc(q , 0) 
nr(q, 6) 


( 2 ) 

( 3 ) 


single-word retrieval performance may be optimis¬ 
tic. For the multiword queries, the documents that 
the query words are drawn from generally drop out 
of the truth set. Thus we expect that the results for 
the multiword queries may actually be slightly pes¬ 
simistic because the document from which the word 
was drawn will have an elevated score due to an ex¬ 
act match to one of the query words. This artificially 
elevated score triggers a false positive. 

Document scoring for multiword queries. A previ¬ 
ous paper 15 reported the behavior of A-best list re¬ 
trieval of single words. Here, we focus on multiword 


The precision-recall curves implicitly parameterized 
by 0 are shown in Figures 2, 3, 4, 5, and 6. Only re¬ 
sults of queries using the same number of words were 
averaged. One-word queries had a truth set size of 
3.9, on average, and a variance of 2.3. On average, 
two-word queries had about 1.2 correct results per 
query. Most had one correct result, a significant num¬ 
ber had two correct results, eight queries had three cor¬ 
rect results, and one query had four correct results. 

Averaging query results. Five words were selected 
from the actual text of each document for use as 
query words. For each retrieval method, a query was 
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performed with every possible combination of the 
five query words from each document. This gener¬ 
ated five single-word queries, ten two-word queries, 
ten three-word queries, five four-word queries, and 
one five-word query from each document in the 
database. 

The ground truth for each query was determined sim¬ 
ply by finding whether all the query words used for 
a given query appeared in the actual text of each of 
the documents. Thus the document truth set varies 
depending on which subset of the five query words 
is used. Each query had at least one correct result 
document, since each group of five query words was 
chosen from the actual text of one document. 

For a fixed number of query words, retrieval results 
were averaged over all possible combinations of sub¬ 
sets of that size from the five query words. This av¬ 
eraging helps to reduce retrieval performance vari¬ 
ability that may occur because of the inherent 
variability of the handwriting representations, and 
to account for the fact that an individual may attempt 
to retrieve a document using one of a variety of dif¬ 
ferent queries. 

Metric optimization. We explored optimizing the 
scored text metric and the dot-product metric by re¬ 
placing them with simple functions of the original 
scores, based onAMrest list rank. We optimized can¬ 
didate functions using an independent data set of 
simple single word queries of words in a small da¬ 
tabase of approximately 1100 handwritten word sam¬ 
ples from 78 writers. The area enclosed by the 
precision-recall curve (see earlier discussion on doc¬ 
ument scoring in this section) obtained by dot-prod¬ 
uct queries in the 1100 word database was used as 
the optimization criterion, a reasonable overall mea¬ 
sure of retrieval performance. 

We optimized by substituting the score at each rank 
with a linear function of the score: 

s 'i = a i s i + ft 

where s t is the original ith rank score, s' is the new 
score, and a t and jS, are the global parameters that 
were optimized. For this case, we ran a Monte Carlo 
optimization of a few thousand trials, concentrating 
the variation in the parameters for the higher ranks. 
We then looked at the sets of parameters that gen¬ 
erated the best results, averaged them, and rounded 
them off. 


Figure 4 Three-word query performance averaged over all 
1080 three-word queries 



Figure 5 Four-word query performance averaged over all 
540 four-word queries 



The TV-best list dot-product metric with rescaled scores 
appears in the results as “Rescaled Dot-Product 
Metric,” and the scored text metric with rescaled 
scores appears as “Rescaled Scored Text Metric.” 

Results 

Figures 2, 3, 4, 5, and 6 show the average precision- 
recall curves for one-, two-, three-, four-, and five- 
word queries. Each graph contains one curve for each 
of the six retrieval methods (as discussed previously 
in the third section). The text metric in each graph 
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Figure 6 Five-word query performance averaged over all 
108 five-word queries 



maybe considered the baseline performance against 
which the other metrics should be compared. 

Single word queries. Inspection of the precision- 
recall curves shows that the different algorithms pro¬ 
duce curves of substantially different shape, so the 
choice of best algorithm depends on the regime. In 
the low-recall/high-precision regime for single-term 
queries (Figure 2), the rescaled scored text metric 
search actually had the best average performance. 
This performance persists up to about 70 percent re¬ 
call, at which point the rescaled dot-product search 
has the best performance. Note that this regime de¬ 
pendence diminishes as query word count increases. 

With the exception of the rescaled score text metric 
method, the text metric is actually on par with or bet¬ 
ter than all other search strategies in the regime up 
to 50 percent recall. All other strategies are supe¬ 
rior at recall levels above 70 percent. 

Multiword queries. In general, the performance im¬ 
proved dramatically as more query terms were 
added. Improvement is partly a result of the reduc¬ 
tion in the truth set size, which is the denominator 
of recall. It can also be seen from the graphs in Fig¬ 
ures 2-6 that as the number of query words increases, 
the regime dependence observed in the single-word 
queries diminishes, so much so that for five word que¬ 
ries, there are no crossovers between the baseline 
method and the other methods. Also note that as 
the number of query words increases, the baseline 


method (text metric) gradually falls in relative per¬ 
formance until it is the worst performing method. 

Conclusions 

We have described the MDR System and have shown 
how its “media agnostic” approach leads to a flex¬ 
ible, extensible, plug-and-play system for multi- 
media and cross-media information retrieval. 

In the context of the MDR System, we have devel¬ 
oped IR algorithms for handwritten documents that 
are suitable for both typed and handwritten queries. 
We have demonstrated that document expansion us¬ 
ing A-best lists can provide improvements in pre¬ 
cision and recall compared to simple text metric, even 
using the most lightweight method (ranked text met¬ 
ric), and that these improvements persist for multi¬ 
word queries. Additional improvements were illus¬ 
trated using more complex metrics. The improved 
searchability that we have demonstrated has the po¬ 
tential to make handwritten databases much more 
valuable to users. 

The methods described in this paper have the ad¬ 
ditional benefit that they are text-based. Unlike tem¬ 
plate matching methods, these methods can lever¬ 
age much of the existing text IR technology and 
enable one-time preprocessing and indexing of an 
A-best list. Furthermore, the approaches presented 
here do not rely on word redundancy to overcome 
transcription errors. This is borne out by the pos¬ 
itive results we obtained on the very short documents 
that comprised the database used. And finally, dot- 
product methods have the potential to retrieve words 
or symbols that are not in the transcription lexicon. 

We are currently working to improve and extend 
both the MDR System and our handwriting-specific 
components. 
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